-
Notifications
You must be signed in to change notification settings - Fork 109
Add support for tracking parent of CodebaseResource entries and ensure top level paths are stored #1691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for tracking parent of CodebaseResource entries and ensure top level paths are stored #1691
Conversation
4ee0dfc
to
8614382
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces an ancestor field to the CodebaseResource model to optimize fetching children of a directory. The existing .children() method which was intended to fetch children uses regex filtering on path, which is slow for large codebases.
Additionally, this PR also ensures that top-level paths are stored during the resource creation step which is essential for rendering the root level nodes in the file tree
Could you explain and document, with data and examples, how storing the parent_path
on the resource, would be used in the rendering context?
No need to add code, just as discussion in the PR as a start would be useful.
Should we deprecate or remove the following methods now that ancestor handles relationship resolution more efficiently?
Let's wait a bit to ensure we have a working system before deprecating existing code ;)
See my comment about the code.
It is missing unit test to ensure the new field is properly set (with regular path, and with root).
Also a unit test to cover the changes made to rootfs.py will be appreciated.
Use of
|
In preparation of adding parent_path as a field #1691 Signed-off-by: tdruez <tdruez@nexb.com>
@aayushkdev Thanks for all the implementation details! I've renamed the old |
Hey @tdruez I have merged your changes and renamed the required fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aayushkdev Looking good, see my suggestion for improvements.
A data migration could solve this. The question is how fast could we migrate the data on existing databases that include millions of CodebaseResource records. |
Yes, iterating through all the resources in CodebaseResource to add the parent_path value to each resource would be quite slow, especially with multiple projects. If you'd like, I can prepare a migration script to test this out and see how it affects performance. |
I don't think a "script" is the right approach here. We need to look into a raw SQL approach for acceptable performances. |
Yeah, I think this will be doable. I’ll explore this further and test it out. |
Hey @tdruez, I’ve got the migration for setting |
Why would you have to insert those field? We are talking about generating a value for
Can you paste your progress on this migration in a comment, it'll be easier to provide you feedback. |
This migration has two parts first, adding the parent_path field to each resource, which I’ve already completed and second, inserting new resources with top-level root paths into CodebaseResource for each project. This is necessary because pipelines like
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('scanpipe', '0073_codebaseresource_parent_path'),
]
operations = [
migrations.RunSQL(
"""
UPDATE scanpipe_codebaseresource
SET parent_path = regexp_replace(path, '/[^/]+$', '')
WHERE path LIKE '%/%';
"""
),
] |
Hey @tdruez @pombredanne , I explored different ways to store trees (like MPTT, Treebeard and ltree), but they all involve more complex integration to scancode.io, introduce even more additional fields, or require us to change the way Resources are created which slows down creation of resources, all while offering a minimal performance benefit over the I also tested the So I don't think its necessary to change our approach on how we handle the storage of trees. |
aboutcode-org#1694) In preparation of adding parent_path as a field aboutcode-org#1691 Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
…ource Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
…h` is always set Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
…one to align with the code format Signed-off-by: Aayush Kumar <aayush214.kumar@gmail.com>
7928f3c
to
2b5b2f7
Compare
fix #1687
Summary/Goals of this change
This PR introduces an
parent_directory_path
field to theCodebaseResource
model to optimize fetching children of a directory. The existing.children()
method which was intended to fetch children uses regex filtering on path, which is slow for large codebases.Additionally, this PR also ensures that top-level paths are stored during the resource creation step which is essential for rendering the root level nodes in the file tree
Breaking Changes
The file tree will rely on the newly introduced ancestor field and the presence of top-level paths. As a result, existing projects scanned before this PR will need to be re-scanned to work with the file tree view.
Open Questions
Should we deprecate or remove the following methods now that ancestor handles relationship resolution more efficiently?
Would love feedback on this!